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Abstract 

The chain-structured long short-term memory 
(LSTM) has showed to be effective in a wide 
range of problems such as speech recognition 
and machine translation. In this paper, we pro¬ 
pose to extend it to tree structures, in which a 
memory cell can reflect the history memories 
of multiple child cells or multiple descendant 
cells in a recursive process. We call the model 
S-LSTM, which provides a principled way of 
considering long-distance interaction over hier¬ 
archies, e.g., language or image parse structures. 
We leverage the models for semantic composi¬ 
tion to understand the meaning of text, a funda¬ 
mental problem in natural language understand¬ 
ing, and show that it outperforms a state-of-the- 
art recursive model by replacing its composition 
layers with the S-LSTM memory blocks. We also 
show that utilizing the given structures is helpful 
in achieving a performance better than that with¬ 
out considering the structures. 


1. Introduction 


Recent years hav e seen a revival of the long shor t-term 
memory (LSTM) ( Hochreiter & Schmidhubeii 1997 ). with 
its effectiveness being demonstrated on a wide range o f 
problems such as spee ch recognition j Graveset^201^, 


machine translation (ISutskever et a 


120141: 


Cho et al. 


2014 ), and image-to-text conversion ( Vinvals et all 2014ll 


On February 6th, 2015, this work was submitted to the Interna¬ 
tional Conference on Machine Learning (ICML). 


among many others, in which history is summarized and 
coded in the memory cell in full-order time sequence. 

Recursion is a fundamental process associated with many 
problems—a recursive process and hierarchical structure 
so formed are common in different modalities. Eor ex¬ 
ample, semantics of sentences in human languages is be¬ 
lieved to be carried by not merely a linear concatena- 
tion of words; inste a d, sen tences have parse structures 
( Manning & Schiitze . 19991) . Image understanding, as 
another example, benefits from recursive modeling over 
structures, which yielded the sta te-of-the-art perform ance 
on tasks like scene segmentation (ISocher et alll201 ih . 


In this paper, we extend LSTM to tree structures, in which 
we learn memory cells that can reflect the history memo¬ 
ries of multiple child cells and hence multiple descendant 
cells. We call the model S-LS TM. Compared with previ - 
ous recursive neural networks ( Socher et all 2013 : 2012 ). 
S-LSTM has the potentials of avoiding gradient vanishing 
and hence may model long-distance interaction over trees. 
This is a desirable characteristic as many of such structures 
are deep. S-LSTM can be considered as bringing the mer¬ 
its of a recursive neural network and a recurrent neural net¬ 
work togethefl In short, S-LSTM wires memory blocks in 
a partial-order structures instead of in a full-order sequence 
as in a chain-structured LSTM. 


We leverage the S-LSTM model to solve a semantic com¬ 
position problem that learns the meaning for a piece of 
texts—learning good representations for meaning of text 
is core to automatically understanding human languages. 
More specifically, we experiment with the models on the 

*As both of them can be shortened to be RNN, in the rest of 
this paper we refer to a Recurrent Neural Network as RNN and a 
Recursive Neural Network as RvNN. 



































Stanford Sentiment Tree Bank ( Socher et all 2013h to de¬ 
termine the sentiment for different granularities of phrases 
in a tree. The dataset has favorable properties: in addition 
to being a benchmark for much previous work, it provides 
with human annotations at all nodes of the trees, enabling 
us to comprehensively explore the properties of S-LSTM. 
We experimentally show that S-LSTM outperforms a state- 
of-the-art recursive model by simply replacing the original 
tensor-enhanced composition with the S-LSTM memory 
block we propose here. We showed that utilizing the given 
structures is helpful in achieving a better performance than 
that without considering the structures. 


2. Related Work 


Recursive neural networks Recursion is a fundamental 
process in different modalities. In recent years, recur¬ 
sive neural networks (RvNN) have been introduced and 
demonstrated to achieve state-of-the-art performances on 
different problems such as semantic analysis in natural Ian 


guage proce ssing and image segmentation (ISocher et al. 


2013 : 201 ih . These networks are defined over recursive 


tree structures—a tree node is a vector computed from 
its children. In a recursive fashion, the information from 
the leaf nodes of a tree and its internal nodes are com¬ 
bined in a bottom-up manner through the tree. Derivatives 
of err ors are computed with backpropagation over struc¬ 
tures ( Goller & Kchler , 1996h . 


In addition, the literature has also included many other ef¬ 
forts of applying fee dforward-based neural network over 
structures, including (Goller & Kchler , 1996 ; Chater , 1992 ; 


Starzvk et al. : Hammer et al. . 20041) . amongst others. For 


instance, Legrand and Collobert leverage neural net¬ 
work s over greedy syntactic parsi ng ( Pinheiro & Collobe^ 


20141) . In (lirsov & Cardiei 120141) . a deep recursive neural 


network is proposed . Nevertheless, over the often deep 
structures, the networks are potentially subject to the van¬ 
ishing gradient problem, resulting in difficulties in lever¬ 
aging long-distance dependencies in the structures. In this 
paper, we propose the S-LSTM model that wires memory 
blocks in recursive structures. We compare our mod el with 
the RvNN models presented in (ISocher et al.Ll2013h . as we 
directly replaced the tensor-enhanced composition layer at 
each tree node with a S-LSTM memory block. We show 
the advantages of our proposed model in achieving signifi¬ 
cantly better results. 


Recurrent neural networks and LSTM Unlike a feed¬ 
forward network, a recurrent neural network (RNN) shares 
their hidden states across time. The sequential history is 
summarized in a hidden vector. RNN also suffers from 
the decaying of gradient, or less frequently, blowing-up 
of gradient problem. LSTM replaces the hidden vector 
of a recurrent neural network with memory blocks which 


are equipped with gates; it can in principle keep long- 
term memory b y training proper gating weights (refer to 
( Gravesl 2008 ) for intuitive illustrations and good dis¬ 
cussions), and it has practically showed to be very use¬ 
ful, achieving the state of the art on a range of prob - 
lems including speech recog nition ( GravesetalJ, 2013), 
digit handwriting recognition ( Liwicki et al. , 200?! : ( 


Graves. 


20121), and achie ve interesting results on statistical ma- 


chine translation ( Sutskever et al. . 2 0141: Choet al.. 2014 ) 
an d music compositio n ( Eck & Schmidhubei . 2002bU a). 
In (IGraves et al.L 12013h . a deep LSTM network achieved 
the state-of-the-art resu lts on the TIMIT phoneme recog¬ 


nition benchmark. In (ISutskever et al.L 120141 : ICho et al. 


20141) . a pair of LSTM networks are trained to encode 


and decode human language for automatic machine trans¬ 
lation, which is in particular effectiv e for the more chal¬ 
lenging long s entence translation. In (ILiwicki et all 12007 ; 
Gravesl 2012h . LSTM networks are found to be very use¬ 


ful for digit writing recognition because of the network’s 
capability of memorizing context i nforma tion in a long 
sequence. In ( Eck & Schmidhul^ 2002bO) . LSTM net¬ 
works are trained to effectively capture global structures of 
the temporal data. With the memory cells, LSTM is able to 
keep track of temporally distant events that indicate global 
music structures. As a result, LSTM can be successfully 
trained to compose music, where other RNNs have failed 
to do so. 


Although promising results have been observed by apply¬ 
ing chain-structured LSTM, many other interesting prob¬ 
lems are inherently associated with input structures that 
are more complicated than a sequence. For example, sen¬ 
tences in human languages are believed to be carried by 
not merely a linear sequence of words; instead, meaning 
is thought to interweave with structures. While a sequen¬ 
tial application of LSTM may capture structural informa¬ 
tion implicitly, in practice it sometimes lacks the claimed 
power. For example, even simply reversing the input se¬ 
quences may result in significant differences in modeling 
performances, in tasks such as machine translation and 
speech recognition. Unlike in previous work, we propose 
here to directly wire memory blocks in recursive struc¬ 
tures. We show the proposed S-LSTM model does utilize 
the structures and achieve results better than those ignoring 
such priori structures. 


3. The Model 

Model brief In this paper, we extend LSTM to structures, 
in which a memory cell can reflect the history memories of 
multiple child cells and hence multiple descendant cells in 
a hierarchical structure. As intuitively showed in Figure [T] 
the root of the tree can in principle consider information 
from long-distance interactions over the tree—in this fig- 
































































































ure, the gray and light-blue leaf. In the figure, the small 
circle (”o”) or short line at each arrowhead indicates 
pass and block of information, respectively. Note that the 
hgure shows a binary case, while in real models a soft ver¬ 
sion of gating is applied, where a gating signal is in the 
range of [0, 1], often enforced with a logistic sigmoid func¬ 
tion. Through learning the gating signals, as detailed later 
in this section, S-LSTM provides a principled way of con¬ 
sidering long-distance interplays over the input structures. 



Figure 1. An example of S-LSTM, a long-short term memory net¬ 
work on tree structures. A tree node can consider information 
from multiple descendants. Information of the other nodes in 
white are blocked. The small circle (”o”) or short line at 

each arrowhead indicates a pass or block of information, respec¬ 
tively, while in the real model the gating is a soft version of gating. 


The memory block Each node in Figure [T] is composed 
of a S-LSTM memory block. We present a specific wiring 
of such a block in Figure |2] Each memory block contains 
one input gate and one output gate. The number of forget 
gates depends on the structure, i.e., the number of children 
of a node. In this paper, w e assume there are tw o children 
at each nodes, same as in (ISocher et all 12013h and there¬ 
fore we use their data in our experiments. That is, we have 
two forget gates. Extension of the model to handle more 
children is rather straightforward. 


As shown in the hgure, the hidden vectors of the two chil¬ 
dren, denoted as h^_i for the left child and h^_i for the 
right, are taken in as input of the current block. The input 
gate it consider four resources of information: the hidden 
vectors and h^_^) and cell vectors and c^_i) of 
its two children. These four sources of information are also 
used to form the gating signals for the left forget gate 
and right forget gate where the weights used to com¬ 
bining them are specihc to each of these gates, denoted as 
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Figure 2. A S-LSTM memory block, consisting of an input gate, 
two forget gates, and an output gate. Hidden vectors ht-i and 
cell vectors Ct_i from the left (red arrows) and right (blue ar¬ 
rows) children are deployed to compute ct and ht- ® denotes a 
Hadamard product, and the “s” shaped sign is a squashing func¬ 
tion (in this paper the tank function). 


different W in the formulas below. Different from the pro¬ 
cess in a regular LSTM, the cell here considers the copies 
from both children’s cell vectors c^i), gated with 

separated forget gates. The left and right forget gates can 
be controlled independently, allowing the pass-through of 
information from children’s cell vectors. The output gate 
Ot considers the hidden vectors from the children and the 
current cell vector. In turn, the hidden vector ht and the 
cell vector c* of the current block are passed to the par¬ 
ent and are used depending on if the current block is a left 
or right child of its parent. In this way, the memory cell, 
through merging the gated cell vectors of the children, can 
reflect multiple direct or indirect descendant cells. As a re¬ 
sult, the long-distance interplays over the structures can be 
captured. More specihcally, the forward computation of a 
S-LSTM memory block is specihed in the following equa¬ 
tions. 


it = cr{Wt,h^_^ + 
+ + ^i) 


( 1 ) 


























+ W,lct,+bf,) (2) 

= a{W^jht, + W^fht, + 

+ + 6/^) (3) 

+ hx (4) 

ct = ft ® Ct-i + /t” ® c^i + *t ® tanh(xt) (5) 

ot = (T{Wtht_i + + WcoCt + bo) (6) 

ht = Ot® tanh(ci) (7) 

where a is the element-wise logistic function used to con¬ 
fine the gating signals to be in the range of [0, 1]; 
and are the left and right forget gate, respectively; b 
is bias and W is network weight matrices; the sign (g) is 
a Hadamard product, i.e., element-wise product. The sub¬ 
scripts of the weight matrices indicate what they are used 
for. For example. Who is a matrix mapping a hidden vector 
to an output gate. 


where cr'(a;) is the element-wise derivative of the logistic 
function over vector x. Since it can be computed with the 
activation of x, we abuse the notation a bit to write it over 
the activated vectors in these equations, is the derivative 
over the cell vector. So if the current node is the left child of 
its parent, we use Equation (fTsT i to calculate , otherwise 
Formula (fT4li is used: 

e* =et ®Ot® g'{ct) + (g) 

+ iWo)fs{u + 

+ (13) 

< ®ot® g'{ct) + e?+i (g /*+! + 

{Wol)^S{l, + {Woo)^S° (14) 

where g'{x) is the element-wise derivative of the tank func¬ 
tion. It can also be directly calculated from the tank acti¬ 
vation of X. The superscript T over the weight matrices 
means matrix transpose. 


Backpropagation over structures During training, the 
gradient of the objective function with respect to each 
parameter can be c alculated efficient l y via backpropaga- 


tion over structures (Goller & Kchlei. 1996 


200). The major difference from that of (iSocher et al 


Socher et al 


2013h is we use LSTM-like backpropagation, where unlike 


a regular LSTM, pass of error needs to discriminate be¬ 
tween the left and right children, or in a topology with more 
than two children, needs to discriminate between children. 
Obtaining the backprop formulas is tedious but we list them 
below to facilitate duplication of our work^. We will dis¬ 
cuss the specific objective function later in experiments. 
For each memory block, assume that the error passed to 
the hidden vector is The derivatives of the output gate 
6°, left forget gate 6 {‘, right forget gate j/'’, and input gate 
6) are computed as: 


. do 

* dht 

(8) 

6° = e’l ® tanh(ct) ® cr'{ot) 

(9) 

5{‘ = el® cf_i ® ct'{ ft) 

(10) 

5^ =el® cf_i ® a'{ft) 

(11) 

51 = el® tanh(a;t) (8) cr'(it) 

(12) 


^The code will be published at www.icml-placeholder- 
only.com 


With derivatives at each gate computed, the derivatives of 
the weight matrices used in Formula ([IJ-© can be calcu¬ 
lated accordingly, which is omitted here. We checked the 
correctness of the S-LSTM implementation with the stan¬ 
dard approximated gradient approach. 


Objective over trees The objective function defined over 
structures can be complicated, which could consider the 
output stru ctures depending on the properties of problem. 
Following ( Socher et al. . 2013h . the overall objective func¬ 
tion we used to learn S-LSTM in this paper is simply mini¬ 
mizing the overall cross-entropy errors and a sum of that at 
all nodes. 


4. Experiment Set-up 


As discussed earlier, recursion is a basic process inherent 
to many problems. In this paper, we leverage the proposed 
model to solve semantic composition for the meanings of 
pieces of text, a fundamental problem in understanding hu¬ 
man languages. 


We specifically attempt to determine the sentiment 
of different granularities of phrases in a tree, within 
the Stanford Senti ment Tree Bank benchmark data 
( Socher et al.[ 2013h . In obtaining the sentiment of 
a long piece of text, early work often factorized 
the problem to consider smaller pieces of compo¬ 
nent words or phrases with bag - of-words or bag-of - 
phrases models ( Pang & Lee . 2008 : Liu & Zhangl 2012 ). 


































More recent work has 

started to model 

composi- 

Default setting In the default setting, we conducted experi- 

tion (Moilanen & Pulmaiu 

20071 Choi&Cardiel 2008; 

ments as in (Socher et al., 20131). TablefTlshows the accura- 

Socher etal.. 20121 20131 

Kalchbrenner et al., 

20141). a 

cies of different models on the test set of the Stanford Senti- 


more principled approach to modeling the formation of 
semantics. In this paper, we put the proposed LSTM 
memory blocks at tree nodes—we replaced the tensor- 
enhanced compositi on layer at each tree node presented in 


(ISocher et al.L 1201 3h with a S-LSTM memory block. We 


used the same dataset, the Stanford Sentiment Tree Bank, 
to evaluate the performances of the models. In addition to 
being a benchmark for much previous work, the data pro¬ 
vide with human annotations at all nodes of the trees, facil¬ 
itating a more comprehensive exploration of the properties 
of S-LSTM. 


ment Tree Bank. We present the results on 5-category sen¬ 
timent prediction at both the sentence level (i.e., the ROOTS 
column in the table) and for all phrases including roots (the 
PHRASES column) 0. In Table [H NB and SVM are naive 
Bayes and support vector mac hine classifiers, resp ectively; 
RvNN corresponds to RNN in ( Socher et al. . 2013h . As de¬ 


scribed earlier, we refer to recursive neural networks to as 
RvNN to avoid confusion with recurrent neural networks. 
RNTN is different from RvNN in that when merging two 
nodes to obtain the hidden vector of their parent, tensor is 
used to obtain the second-degree polynomial interactions. 


4.1. Data Set 


The Stanford Sentiment Tree Bank (ISocher et al.L 1201 3h 
contains about 11,800 sentences from the movie reviews 


that were originally discussed in (IPang & Leel 120051) . 
The sentences were pa rsed with the Stanford parser 
( Klein & Manning! 2003). Phrases at all the tree nodes 
were manually annotated with sentiment valu es. We use the 


same split of the training and test data as in (ISocher et al 


20131) to predict the sentiment categories of the roots (sen¬ 
tences) and all phrases (including sentences). For the root 
sentiment, the training, development, and test sentences are 
8544, 1101, and 2210, respectively. The phrase sentiment 
task includes 318582 , 41447, and 82600 phrases for the 
three sets. Following ( Socher et al.L 2013h . we also use the 
classification accuracy to measure the performances. 


4.2. Training Details 

As mentioned before, we follow ( Socher et al.[ 2013 ) to 
minimize the cross-entropy error for all nodes or for roots 
only, depending on specific experiment settings. For all 
phrases, the error is calculated as a regularized sum: 




Alien; 


(15) 


where y® 


pcxl 


is predicted distribution and P G 


the target distribution, c is the number of classes 
or categories, and j € c denotes the j-th element of the 
multinomial target distribution; i iterates over nodes, 9 are 
model parameters, and A is a regularization parameter. We 
tu ned our mode l again st the development data set as split 
in ( Socher et al. . 2013 ). 


5. Results 

To understand the modeling advantages of S-LSTM over 
the structures, we conducted four sets of experiments. 


Table 1. Performances (accuracies) of different models on the test 
set of Stanford Sentiment Tree Bank, at the sentence level (roots) 
and the phrase level, f shows the performance are statistically 
significantly better (p < 0.05) than the corresponding models. 


Models 

ROOTS 

PHRASES 

NB 

41.0 

67.2 

SVM 

40.7 

64.3 

RvNN 

43.2 

79.0 

RNTN 

45.7 

80.7 

S-LSTM 

48.0t 

81.9t 


Table [T] showed that S-LSTM achieved the best predictive 
pe rformance, when co mpared to all the models reported 
in ( Socher et ah . 2013h . The S-LSTM results reported here 
were obtained by setting the size of the hidden units to 
be 100, batch size to be 10, and learning rate to be 0.1. 
In our experiments, we only tuned these hyper-parameters, 
and we feel that more finer tuning, such as discriminating 
the classification weights between the leaves (word embed¬ 
ding) and other nodes, using different numbers of hidden 
units for the memory blocks (e.g., for the hidden layers of 
words), or different initializations of word embedding, may 
further improve the performances reported here. 


To evaluate the S-SLTM model’s convergence behavior. 
Figure [3 depicts the converging time during training. More 
specifically, we show two sub-figures: one for roots (upper 
sub-figure) and the other for all phrases (lower sub-figure). 
From these figures, we can observe that S-LSTM converge 
faster than the RNTN. For instance, for the phrase-level 
task, S-LSTM started to converge after about 20 minutes 
but the RNTN needed over 180 minutes. S-LSTM has 

^The Stanford CoreNLP package 

I http://nlp.stanford.edu/sentiment/code.html) only gives ap¬ 
proximate accuracies for 2-category sentiment, which are not 
included here in the table. 



































































much less parameters than RNTN and the forward and 
backward propagation can be computed efficiently. 




Figure 3. Converging time during training for roots (the upper fig¬ 
ure) and for all nodes (the lower figure). 


More real-life settings We further compare S-LSTM with 
RNTN in two more experimental settings. In the first set¬ 
ting we only keep the training signals at the roots to train 
S-LSTM and RNTN, depicted as model (1) and (2) in Ta- 
ble|2] ROOT LBLS besides the model names stands for root 
labels', that is, only the gold labels of the sentence level are 
used to train the model. In most sentiment analysis circum¬ 
stances, phrase level annotations are not available: most 
nodes in a tree are fragments that may not be that interest¬ 
ing; e.g., the fragment “of a good movie ” 0 . Also, annotat¬ 
ing all phrases is expensive. However, these should not be 
regarded as comments on the value of the Sentiment Tree 
Bank. Detailed annotations in the tree bank enable much 
interesting work to be possible, e.g ., the study of th e effect 
of negation in changing sentiment (IZhu et alil2014l) . 


The second setting, corresponding to model (3) and (4) in 
Table |2l is only slightly different, in which we keep an¬ 
notation for the tree leafs as well, to simulate that a sen¬ 
timent lexicon is available and it covers all leafs (words) 
{LEAFLETS along the side of the model names stands 
for leaf labels), and so there is no out-of-vocabulary con¬ 
cern. Using real sentiment lexicons is expected to have a 
performance between the two settings here. 

'’Phrase-level sentiment analysis is often defined over a very 
small subse t of phrases of interest, such as in the phrase-l evel task 
defined in dWilson et al.Ll200^ : lMohammad et ml2Q13h . 


Results in the table show that in both settings, S-LSTM out¬ 
performs RNTN by a large margin. When only root labels 
are used to train the models, S-LSTM obtains an accuracy 
of 43.5, compared with 29.1 of RNTN. When the leaf la¬ 
bels are also used, S-LSTM achieves an accuracy of 44.1 
and RNTN 34.9. All these improvements are statistically 
significant {p < 0.05). For the RNTN, without supervising 
signals from the internal nodes, the composition parameters 
may not be learned well, potentially because the tensor has 
much more parameters to learn. On the other hand, through 


controlling its gates, the S-LSTM shows a 
to learn from the trees. 

very good ability 

Table 2. Performances of models trained with only root labels (the 
first two rows) and models that use both root and leaf labels (the 

last two rows). 


Models 

ROOTS 

(1) RNTN (RootLbls) 

29.1 

(2) S-LSTM (Root Lbls) 

43.5t 

(3) RNTN (Root -i- Leaf Lbls) 

34.9 

(4) S-LSTM (Root -i- Leaf Lbls) 

44.lt 


Performance over different levels of trees In Figure ID 
we further depict the performances of models on different 
levels of nodes in the trees. In the Figure, the x-axis corre¬ 
sponds to different depths or lengths and y-axis is accuracy. 
The depth here is defined as the longest distance between 
the root of a phrase and their descendant leafs. The Length 
is simply the number of words of a node, where depth is 
not necessarily to be length —e.g., a balanced tree with 4 
leafs has different depths than the unbalanced tree with the 
same number of leafs. The trends of the two figure are sim¬ 
ilar. In both figures, S-LSTM performs better at all depths, 
showing its advantages on nodes at depth. As the deeper 
levels of the tree tend to have more complicated syntax and 
semantics, S-LSTM can model such more complicated syn¬ 
tax and semantics better. 

Explicit structures vs. no structures Some efforts in the 
literature attempt to learn distributed representation by uti¬ 
lizing input structures when available, and others prefer to 
assume chain-structured recurrent neural networks can ac¬ 
tually capture the structures implicitly though a linear cod¬ 
ing process. In this paper, we attempt to give some empir¬ 
ical evidences in our experiment setting by comparing sev¬ 
eral different models. First, a special case for the S-LSTM 
model is considered, in which no sentential structures are 
given. Instead, words are read from left to right and com¬ 
bined in that order. We call it left recursive S-LSTM, or S- 



















Figure 4. Accuracies at different depths (the upper figure) in the 
trees, or hy different lengths of the phrases (the lower figure). 

LSTM-LR in short. Similarly, we also experimented with 
a right recursive S-LSTM, S-LSTM-RR, in which words 
are read from right to left instead. Since for these models, 
phrase-level training signals are not available—the nodes 
here do not correspond to that in the original Standford 
Sentiment Tree Bank, but the roots and leafs annotations 
are still the same, so we run two versions of our experi¬ 
ments; one uses only training signals from roots and the 
other includes also leaf annotations. 

It can be observed from Table[3]that the given parsing struc¬ 
ture helps improve the predictive accuracy. In the case of 
using only root labels, the left recursive S-LSTM and right 
recursive S-LSTM have similar performance (40.2 and 
40.3, respectively), both inferior to S-LSTM (43.5). When 
using gold leaf labels, the gaps are smaller, but still, using 
the parse structure are better. Note that in real applications, 
where there is no out-of-vocabulary issue (i.e., some leafs 
are not seen in the sentiment dictionaries), the difference 
between S-LSTM and the recursive version without using 
the structures are expected to be between the gaps we ob¬ 
served here. 

6. Conclusions 

We aim to extend the conventional chain-structured long 
short-term memory to explicitly consider structures. In 
this paper we particularly study tree structures, in which 


Table 3. Performances of models that do not use the given sen¬ 
tence structures. S-LSTM-LR is a degenerated version of S- 
LSTM that reads input words from left to right, and S-LSTM-RR 
reads words from right to left. 


Models 

ROOTS 

S-LSTM-LR (Root Lbls) 

40.2 

S-LSTM-RR (Root Lbls) 

40.3 

S-LSTM (Root Lbls) 

43.5t 

S-LSTM-LR (Root -i- Leaf Lbls) 

43.1 

S-LSTM-RR (Root -i- Leaf Lbls) 

43.2 

S-LSTM (Root + Leaf Lbls) 

44.lt 


the proposed S-LSTM memory cell can reflect the history 
memories of multiple descendants through gated copying 
of memory vectors. The model provides a principled way 
to consider long-distance interplays over the structures. We 
leveraged the model to learn distributed sentiment repre¬ 
sentations for texts, and showed that it outperforms a state- 
of-the-art recursive model by replacing its tensor-enhanced 
composition layers with the S-LSTM memory blocks. We 
showed that the structure information is useful in helping 
S-LSTM achieve the state-of-the-art performance. 

The research community seems to contain two lines of wis¬ 
dom; one attempts to learn distributed representation by 
utilizing structures when available, and the other prefers to 
believe recurrent neural networks can actually capture the 
structures implicitly through a linear-chain coding process. 
In this paper, we also attempt to give some empirical evi¬ 
dences toward answering the question. It is at least for the 
settings of our experiments that the explicit input structures 
are helpful in inferring the high-level (e.g., root) semantics. 
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